Before the analysis, I would like to have a look at the data structure of wine data set which I combined the data set of red wine and white wine.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "total.acidity" "quality" "color"
## [16] "quality_level"
## 'data.frame': 6497 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ total.acidity : num 8.1 8.68 8.56 11.48 8.1 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ color : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
## $ quality_level : Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 1 1 1 3 3 1 ...
## [1] 6497 16
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## alcohol total.acidity quality color
## Min. : 8.00 Min. : 4.110 Min. :3.000 red :1599
## 1st Qu.: 9.50 1st Qu.: 6.710 1st Qu.:5.000 white:4898
## Median :10.30 Median : 7.300 Median :6.000
## Mean :10.49 Mean : 7.555 Mean :5.818
## 3rd Qu.:11.30 3rd Qu.: 8.050 3rd Qu.:6.000
## Max. :14.90 Max. :16.285 Max. :9.000
## quality_level
## Low :2384
## Medium:2836
## High :1277
##
##
##
## [1] 1599 16
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol total.acidity quality color
## Min. : 8.40 Min. : 5.120 Min. :3.000 red :1599
## 1st Qu.: 9.50 1st Qu.: 7.680 1st Qu.:5.000 white: 0
## Median :10.20 Median : 8.445 Median :6.000
## Mean :10.42 Mean : 8.847 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.: 9.740 3rd Qu.:6.000
## Max. :14.90 Max. :16.285 Max. :8.000
## quality_level
## Low :744
## Medium:638
## High :217
##
##
##
## [1] 4898 16
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol total.acidity quality color
## Min. : 8.00 Min. : 4.110 Min. :3.000 red : 0
## 1st Qu.: 9.50 1st Qu.: 6.570 1st Qu.:5.000 white:4898
## Median :10.40 Median : 7.070 Median :6.000
## Mean :10.51 Mean : 7.133 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.: 7.590 3rd Qu.:6.000
## Max. :14.20 Max. :14.470 Max. :9.000
## quality_level
## Low :1640
## Medium:2198
## High :1060
##
##
##
## wine_data$color: red
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## --------------------------------------------------------
## wine_data$color: white
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
## wine_data$color: red
## [1] 0.8075694
## --------------------------------------------------------
## wine_data$color: white
## [1] 0.8856386
As the plot shown above, The histograms are overlapped with the normal distributions which generated by the mean and standard deviation of the variable quality. It is reasonable that the histograms of both red wine and white red fit the normal distributions well. The mode of red wine locates at quality 5 and the one of white wine locates at quality 6. Besides, the quantity of white wine is much more than that of red wine. So it is obvious that the histogram of white wine fit better.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.400 7.000 7.215 7.700 15.900
The overall distributions of the wiens are similar to normal distribution. However, the distribution of white wine is hard to observed, but it is more concentrated. On the other hand, the distribution of red wine is more wider and it looks like a right skwed distribution. As the summary results, the range, 4.6 to 15.90 g/dm³, of red wine is wider than that, 3.8 to 14.2 g/dm³, of white wine. Besides, all the data of red wine are higher than that of white wine.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 4.6 7.1 7.9 8.32 9.2 15.9 1.741
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 5.2 7.1 7.7 8.048 8.9 11.8 1.343
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 3.8 6.3 6.8 6.855 7.3 14.2 0.844
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 4.8 6.3 6.8 6.81 7.3 8.8 0.727
As we can see the above plots, the distribution of red wine does change too much before and after outlier handling. According to the statistic information, the most values just change very slightly. But the distribution becomes more concentrated a little bit as what standard deivation can tell. Additionally, both histograms of red wine do not fit normal distribution well, because they are more likely right-skewed distribution.
On the other hand, we can see that the original data of white wine has more outliers. the range of x-axis changes much before (from 4 to 14 g/dm³) and after (from 4 to 9 g/dm³). According to the statistic information, most values do not change a lot except the max. value. Besides, both histograms of white wine fit normal distribution well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2300 0.2900 0.3397 0.4000 1.5800
The overall distribution of the wine is more like a right skewed distribution. The overall range is from 0.08 to 1.58 g/dm³. As we can see here, the distribution of white wine is more concentrated than that of red wine.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.12 0.39 0.52 0.528 0.64 1.58 0.179
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.12 0.39 0.52 0.522 0.635 0.98 0.165
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.08 0.21 0.26 0.278 0.32 1.1 0.101
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.08 0.21 0.26 0.264 0.31 0.48 0.076
As we can see, the histograms of red wine and white wine before and after outlier handling fit the normal distribution well. And both red wine and white wine have the great outliers on the right sides. After the outlier handling, we can see that both distribution are more concentrated, especially the one of white wien. On the other hand, it is more clear that there are two peaks in the distribution of red wine. According to the statistic information, most values just change too slightly to notable. But the max values change very much, just as the same as the result of fixed acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2500 0.3100 0.3186 0.3900 1.6600
The distribution of white wine can be thought as a edge peak distribution. It looks like the normal distribution except that it has a large peak at one tail. On the other hand, the distribution of red wine is more like Plateau distribution. there are many peaks close together, the top of the distribution resembles a plateau. About the summary result, the range, 0.00 to 1.66 g/dm³, of whtie wien is wider than that, 0.00 to 1.00 g/dm³, of red wine.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0 0.09 0.26 0.271 0.42 1 0.195
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0 0.08 0.23 0.238 0.38 0.69 0.175
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0 0.27 0.32 0.334 0.39 1.66 0.121
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.12 0.27 0.31 0.321 0.37 0.52 0.081
After outliers handling, we can find out that the data of both red wine and white wine have the outliers with great value, especially the one of white wine. And it is more clear that the distribution of red wine is the Plateau distribution. On the other hand, the distribution of white fits the normal distribution well, but there is a clear peak value at about 0.5 g/dm³.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 3.000 5.443 8.100 65.800
The distributions of both wines are right skewed distribution, especially for that the counts concentrates on the ragne between 1.5 to 3.0 g/dm³. About the summary result, the range, 0.60 to 65.80 g/dm³, of whtie wien is wider than that, 0.90 to 15.5 g/dm³, of red wine. Besides, It is obvious that most white wines have more residual sugar than red wines do.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.9 1.9 2.2 2.539 2.6 15.5 1.41
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 1.2 1.9 2.1 2.128 2.4 3.1 0.375
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.6 1.7 5.2 6.391 9.9 65.8 5.072
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.6 1.8 5.2 6.36 9.6 22 4.908
In the plots, the distribution after outlier handling of red wine becomes very concentrated (from 1.9 to 3.1 g/dm³) and it fits the normal distribution very well. However, we can find out that the data of white wine have the outliers wiht very great value. After the outliers handling, the range does concentrate very much. But, the disttribution still fits normal distribtuion very badly and it is quite wider (from 0.6 to 22 g/dm³).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
The distributions of both wines are similar to a normal distribution. However, both distributions have a long tail on the right side. But I think it can be thought as the outliers. In the distribution of white wine, most data concentrates on the range 0.02 to 0.12 mg/dm³. And the distrigution of red wine, most data concentrates on the ragnge 0.06 to 0.16 mg/dm³.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.012 0.07 0.079 0.087 0.09 0.611 0.047
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.045 0.069 0.078 0.078 0.086 0.112 0.013
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.009 0.036 0.043 0.046 0.05 0.346 0.022
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.015 0.035 0.042 0.042 0.049 0.07 0.01
After the outliers handling, the distributions of both red wine and white wine fit the normal distribution very well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 29.00 30.53 41.00 289.00
The distribution of white wine is similar to a normal distribution with a long tial on the right side. On the other hand, the distribution of red wine is similar to right skewed distribution. In the distribution of white wine, most data concentrates on the range 0 to 80 mg/dm³. And the distrigution of red wine, most data concentrates on the ragnge 0 to 50 mg/dm³.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 1 7 14 15.87 21 72 10.46
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 1 8 13 15.08 20 42 8.885
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 2 23 34 35.31 46 289 17.007
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 2 24 34 34.59 45 78 14.838
Both red wine and white wine have the outliers with great values. After the outlier handling, it is more clear that the distribution of red wine is more likely a right skewed distribution. On the other hand, the distribution of white wine fits the normal distribution very well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 77.0 118.0 115.7 156.0 440.0
The distribution of red wine is similar to right skewed distribution with the outliers. On the other hand, the ditstribution of white wien is bimodal distribution with the outliers. About the summary data, all the data of white wine are higher than that of the red wine.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 6 22 38 46.47 62 289 32.895
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 6 22 35 41 54 111 24.496
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 9 108 134 138.4 167 440 42.498
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 21 107 132 136.8 166 255 41.045
In this plots, red wine has the outliers with very great value. After the outlier handling, the range becomes as 1/3 as original one. The distribution does not change much before and after the outlier handling. On the other hand, the values of outliers of white wine is not so great based on the new range (from 20 to 260 mg/dm³) of the distribution. Similarly, the distribution of white wine does not change before and after the outliers handling, and both of them fit the normal distribution well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9923 0.9949 0.9947 0.9970 1.0390
The distribution of white wine can be thought as bimodal distribution and the one of red wine is similar to a normal distribution. And most counts of both wines are around 0.995 g/dm³.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.99 0.996 0.997 0.997 0.998 1.004 0.002
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.992 0.996 0.996 0.996 0.997 1.001 0.002
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.987 0.992 0.994 0.994 0.996 1.039 0.003
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.987 0.992 0.994 0.994 0.996 1.002 0.003
In this plot, the data of white wine has the outliers with relatively great value. On the other hand, the range of red win does not change too much. After the outliers handling, the distribution of red wine still fits the normal distribution well. But the distribution processed the outliers handling becomes like a right skewed distribution a little bit.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.110 3.210 3.219 3.320 4.010
The distributions of both wines are similar to a normal distribution. And most data of both wines concentrate on the range 2.7 to 3.7. However, there still are serval outliers over the value 4.0.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 2.74 3.21 3.31 3.311 3.4 4.01 0.154
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 2.94 3.24 3.33 3.331 3.41 3.68 0.129
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 2.72 3.09 3.18 3.188 3.28 3.82 0.151
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 2.82 3.1 3.18 3.188 3.28 3.55 0.137
The distributions of both red wine and white wine almost do change before and after the outliers handling. They still fit the normal distribution very well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4300 0.5100 0.5313 0.6000 2.0000
The distributions of both wines are similar to a normal distribution. However, the distribution of red wine has a long tail on right side, which is thought as the outlier. And about the summary data, all the data of red wine are higher than that of the white wine.
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.33 0.55 0.62 0.658 0.73 2 0.17
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.33 0.55 0.61 0.629 0.7 0.95 0.113
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.22 0.41 0.47 0.49 0.55 1.08 0.114
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 0.22 0.41 0.47 0.478 0.54 0.73 0.094
The data of red wine has the outliers with the values greater than that of white wien. After the outlier handling, the plot shows that the distribution of red wine is like a right skewed distribution. About the white wine, the distribution does not change too much. Just some outliers were removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90
The distributions of both red wine and white wine are similar. In my opinion, they are similar to right skewed distribuiton with noise, espeically for the distribution of white wine. The most counts of both red wine and white wine locate at 9.5%
## Red Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 8.4 9.5 10.2 10.42 11.1 14.9 1.066
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 8.7 9.5 10.1 10.36 11 13.1 0.965
## ==================================================================
## White Wine:
## --------------------------------------------------------
## with outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 8 9.5 10.4 10.51 11.4 14.2 1.231
## --------------------------------------------------------
## without outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD
## 8.4 9.5 10.5 10.6 11.4 14.2 1.214
After the outlier handling, not only the shape of the distributions but also the ranges of them do not change. And they also do not fit the normal distribution well. There are many peaks over the both distribuiton.
There are 1599 records of red wine and 4898 records of white wine in the dataset with 13 original features and 2 features, color and quality_level, which I added.
About the features, there is one categorical variable, quality, and the others are numerical variables that indicate physical and chemical properties of the wine.
According to the plots, most features’ distributions of both red wine and white wine are similar to normal distribution and skewed distribution except the one of white wine’s total sulfur dioxide and density, which are bimodal distribution distribution, and the one of red wine’s citric acid, which is plateau distribution.
Most distributions of red wine are similar to that of white wine except the ones of citric acid and total sulfur dioxide.
Although the most distributions of red wine and white wine are similar, the values are quite different. Therefore, it is interesting that whether the feautres, which affect the quality of red wine and white wine, will be different or not.
The main feature in the data set is quality. I would like to determine which features are sutible for the predicting the quality of the wine. Furthermore, whether the conclusions of red wine and white wine will be similar or not.
According to the document, wine quality information, I think that the main features should be alcohol, free sulfur dioxide and acidity, especially citric acid.
Yes, I created the variable, quality_level, to generate better plots. It has three values, low, medium and high, which are generated by quantile with 1/3, 2/3 and 1 proporation of the current data set. Besides, I also create the variable, total acidity, which is obtained by the summation of fixed acidity and volatile acidity.
In order to do the analysis of outlier handling, I create other three data frames, wine_data_no_color, wine_data_no_quality_level and wine_data_no_quality, to store the data processed by outlier handling. This data frame would be used in bivarite plots/analysis section and multivariate plots/analysis section. The different among these data are the groups during the outliers handling. wine_data_no_color is grouped by wine’s color, wine_data_no_quality_level is grouped by quality level and wine’s color and wine_data_no_quality is grouped by quality and wine’s color. At the beginning, I think that it should be quite different between red wine and white wine. Therefore, almost the data handling were carry out respectively.
As my observation, most distributions are similar to normal distribution. These features can be thought as that most values are close to the mean and median count.However, there are two features, residual sugar and total sulfur dioxide, shows different distribution. In the plot of residual sugar, the distributions of both red wine and white wine are like right-half normal distribution. In the plot of total sulfur dioxide, the distribution of white wine shows that there are two peaks at value 20-30 and 110-120.
Additionally, I performed the analysis of outlier handling. In this section, I just used wine_data_no_color to do the comparison between the original data and outliers handling data. I was not too suprised that the distributions of each feature do not chagne very much before and after the outliers handling. But the outliers handling does help me to observe the distribution of each features and the difference between red wine and white wine.
The above plots show the correlations between the features in the dataset seperated by the kind of the wine. A darker color means a stronger correlation. A red color is a positive correlation, where as blue is a negative correlation. The numbers in the boxes are the correlation coefficients.
According to the above plots, it is obvious that the correlation coefficient of each feature with each feature are different between red wine and white wine.
First, the feature of interest of this report is the variable quality. Most influence by other features are similar between red wine and white wine, especially for alcohol, 0.5 in red wine and 0.4 in white wien. On the other hand, fixed acidity has positive correlation, 0.1, with quality in red wine. But it is negative value, -0.1, in white wine.
Second, I picked several high correlations to plot. I classified them into two section. One is “The Features Highly Correlated to Density of Wine”, and the other is “The Features Highly Correlated to Acidity of Wine”. Besides, I also picked the correlation of Total sulfur oxides and free sulfur dioxide because free sulfur dioxide is part of total sulfur dioxide. Thus, the plot of Total sulfur oxides vs free sulfur dioxide should show the good correlation.
## color corr
## 1 red 0.04207544
## 2 white -0.45063122
As we can see, the scatter point distribution of red wine concentrates on the left side. No matter what alcohol content is the residual sugar just chage slightly. It is consistent with the correlation coefficient which is almost 0. On the other hand, the scatter point distribution of white wine is much wider, andd the shape is like a triangle. alcohol tends to decrease with increasing residual sugar. Although the correlation coefficient is not so strong, we still observe that it is surely that the relationship between alcohol and residual sugar of whtie wine is negative correlation.
## Grouped by Wine's Color
## color corr
## -------------------------------
## red 0.122
## white -0.49
## ===============================
## Grouped by Quality Level and Wine's Color
## color corr
## -------------------------------
## red 0.066
## white -0.467
In this plots, we can see that the distribution of red wine becomes very narrow, means that there were lots of outliers in the data and it is turely that there is no relation between alcohol and residual sugar in red wine, although the regression line look quite steep. However, the distribution of scatter points of white wine does not change. Only few outliers have been removed. That means alcohol content and reisdual sugar do affect each other.
## color corr
## 1 red 0.6676665
## 2 white 0.6155010
To picked up this correlation is because the correlation coefficients, 0.67 and 0.62 respecitvely, of both red wine and white red are quite strong. As what I expected, the scatter points of both red wine and white wine show that total sulfur dioxide tends to increase with increasing free sulfur dioxide, although the data of both red wine and white wine become divergent on the upper right side. And the most data of white wine are obviously greater than that of red wine.
## Grouped by Wine's Color
## color corr
## -------------------------------
## red 0.629
## white 0.619
## ===============================
## Grouped by Quality Level and Wine's Color
## color corr
## -------------------------------
## red 0.639
## white 0.624
In this plot, the correlation was quite strong and the distribution of original data also shows this tendency. The outliers handing only cleanup the data so that the distribution look more clear. The distribution of white wine look like a oval, but it looks like cone shape for red wine.
Red (±) / White (±)
------------------------------------------------------------------------------------
Density vs Alcohol strong (-) / strong (-)
Density vs Fixed Acidity strong (+) / weak (+)
Density vs Residual Sugar weak (+) / strong (+)
=================================================================================
Density vs Total Sulfur Dioxide no ( ) / weak (+) <- without outliers handling
Density vs Total Sulfur Dioxide weak ( ) / strong (+) <- with outliers handling
=================================================================================
Citric Acid vs pH strong (-) / weak (-)
Volatile Acidity vs pH weak (+) / no ( )
Fixed Acidity vs pH strong (-) / medium (-)
Volatile Acidity vs Citric Acid strong (-) / weak (-)
Fixed Acidity vs Citric Acid strong (+) / medium (+)
Residual Sugar vs Alcohol no ( ) / medium (-)
Total Sulfur Dioxide vs Free Sulfur Dioxide strong (+) / strong (+)
After the outliers handing, most correlation does not change except the data set of density and total sulfur dioxide, which becomes much stronger correlation from weak or even no relation.
The feature of interest is quality in this report. According to the correlation plot, it is strong relation between quality and alcohol in both red wine and white wine. But the other relations are thought as the weak relations in both red wine and white wine. The interesting things is that the acditiy especially fixed acidity shows the different behaviors between red wine and white wine.
I found acditiy and residual sugar have the different influence on density between red wine and white wine. According to the analysis, acditiy especially for fixed acidity has greater influence on density in red wine. On the other hand, residual sugar has greater infleunce on density in white wine.
Considering that quality and alcohol have strongest relation and that density and alcohol have very strong relation. Acidity might be the secondary fact of quality of red wine to be considered and residual sugar might be the secondary fact of quality of white wine to be considered.
Furthermore, the analysis of volatile acidity shows an interesting results. acidity is usually thought to contribute to decreasing pH. However, increasing volatile acidity in red wein contribute to increasing pH.
After the outliers handling, most correlation between each feature do not change or just change a little bit. But overall tendencies are as the same as the original one. They just become much more clear for their shape of the distribution of scatter points. And as the aspect of the different color, it almost makes no differnece between the data sets grouped by color and by quality level (including color). However, I found an interesting case that the correlation between density and total sulfur dioxide become much stronger relation from weak one or even no relation. That means, amoung lots combination of features, I found that the outliers do affect the relation between total sulfur dioxide afftects density. And with the clean data, the relation between density and total sulfur dioxide is much stronger than what we thought due to the resultes of original data.
In red wine, the strongest relationship is between pH and fixed acidity. The correlation coefficient is -0.683 On the other hand, the strongest relationship is between density and residual sugar. The correlation coefficient is 0.839.
In red wine, the strongest relationship is between pH and fixed acidity. The correlation coefficient is -0.673 On the other hand, the strongest relationship is between density and residual sugar. The correlation coefficient is 0.843.
## color quality_level corr
## 1 red Low -0.3104089
## 2 red Medium -0.5471226
## 3 red High -0.5841169
## 4 white Low -0.6801808
## 5 white Medium -0.7435118
## 6 white High -0.8436137
In this plot, we can see that the tneds of realtionships between density and alcohol are similar between red wine and white wine. Bacially, the stonger relationships with the higher quality level, especially for white wine at high quality level, the correlation coefficient of that is -0.84. And the overall tends also indicates that the relationships of white wine are stronger than that of red wine.
## ===============================================
## Grouped by Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low -0.454
## Red | Medium -0.586
## | High -0.639
## ----------------------------------------------
## | Low -0.706
## White | Medium -0.817
## | High -0.857
## ==============================================
## Grouped by Quality Level and Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low -0.305
## Red | Medium -0.572
## | High -0.621
## ----------------------------------------------
## | Low -0.675
## White | Medium -0.814
## | High -0.747
## ===============================================
After outliers handling, the distribution of the scatter points of two data sets (different grouping ways) look similar. However, we can find out that the tendencies of white wine are different when we focuse on the correlation coefficient. In the result of the data grouped by wine’s color, the higher the quality level is the stronger the correlation coefficient is. On the other hand, in the result of the data grouped by quality level and wine’s color, the correlation coefficient of medium quality level is higher than that of high quality level.
## color quality_level corr
## 1 red Low 0.4051275
## 2 red Medium 0.3452290
## 3 red High 0.3498892
## 4 white Low 0.8796645
## 5 white Medium 0.8556329
## 6 white High 0.8202080
We can see that residual sugar has high relationship with density in white wine. On the other hand, the relationships between residual sugar and density are relaively weak in red wine. Besides, the scatter points of red wine concentrates on the range 1.25 to 2.5 g/dm³, but the scatter points of white wine in the range 1.25 to 20 g/dm³. When we focuses on the plot of white wine, the correlation cofficient is getting lower slightly with increasing quality level.
## ===============================================
## Grouped by Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.473
## Red | Medium 0.269
## | High 0.351
## ----------------------------------------------
## | Low 0.899
## White | Medium 0.846
## | High 0.832
## ==============================================
## Grouped by Quality Level and Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.596
## Red | Medium 0.315
## | High 0.367
## ----------------------------------------------
## | Low 0.915
## White | Medium 0.846
## | High 0.644
## ===============================================
After the outliers handling, the distributions of scatter points of red wine in both data set become very small. But the distributions of white wine do not change too much. When we focuse on the correlation coefficient, the results of red wine do not chage too much. On the other hand, the results of the data grouped by wine’s color are almost as the same as the reuslt of original data. However, the correlation coefficint decrease dramatically with increasing quality level of the data grouped by quality level and wine’s color. It shows that the relation between residual sugar and density is not so stong as what I though at high quality level.
## color quality_level corr
## 1 red Low 0.09424935
## 2 red Medium -0.02603628
## 3 red High 0.07175752
## 4 white Low -0.43353934
## 5 white Medium -0.45499608
## 6 white High -0.48392064
As what we know that alcohol is generated by the fermentation of the sugar (wikipedia). Additionally, the relationship between density and alcohol and the one between density and residual sugar. I found that density has high relationship with residual sugar in white wine. Therefore, I would like to take a look at the relationship between alcohol and residual sugar. As what we see, it can be thought that the realtionships are quite weak in red wine. The correlation coefficient at all quality level are almost 0. On the other hand, the relationships are medium, which are -0.43, -0.45 and -0.48 at each quality level respectively, in white wine. In my opinion, it might imply that the fermentation of red wine is almost done so that the content of residaul sugar affect the qualtiy very slightly. However, the degree of the fermentation of white wine seems affect the quality much more.
## ===============================================
## Grouped by Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.14
## Red | Medium 0.11
## | High 0.201
## ----------------------------------------------
## | Low -0.467
## White | Medium -0.501
## | High -0.52
## ==============================================
## Grouped by Quality Level and Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.029
## Red | Medium 0.102
## | High 0.199
## ----------------------------------------------
## | Low -0.47
## White | Medium -0.497
## | High -0.111
## ===============================================
After the outliers handling, the distribution of the scatter points of red wine become very narrow. And the correlation coefficients also do not change. Namely, there is no relation between alcohol and residual sugar for red wine. On the other hand, the results of different grouping ways are qutie different. The result of the data grouped by wine’s color shows that outliers handing strengthen the relation between alcohol and residual sugar in white win, especially for that the correlation is getting strong with increasing quality level. However, the result of the data grouped by quality level and wine’s color shows the opposite information. The relation increase with increasing quality level from low level to medium level, and then it decreases dramatically with increasing quality level from medium level to high level. It indicates that it is weak or even no relation between alcohol and residual sugar in white wine at high quality level. Thal also implies that the fermentation is almost done for white wine at high quality level.
## color quality_level corr
## 1 red Low 0.6892823
## 2 red Medium 0.7004321
## 3 red High 0.7817219
## 4 white Low 0.1706753
## 5 white Medium 0.2223461
## 6 white High 0.4374344
In contrast to the relationship between density and residual sugar, the relationships between fixed acidity and density in red wine are strong at all quality level. Similarly, the scatter points of red wine also shows that fixed acidity tends to increase with increasing density. On the oter hand, the relationships in white wine are quite weak. The scatter points of white wine shows that there is almost no relationships between fixed acidity and density.
## ===============================================
## Grouped by Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.658
## Red | Medium 0.591
## | High 0.617
## ----------------------------------------------
## | Low 0.082
## White | Medium 0.209
## | High 0.415
## ==============================================
## Grouped by Quality Level and Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.648
## Red | Medium 0.62
## | High 0.656
## ----------------------------------------------
## | Low 0.085
## White | Medium 0.204
## | High 0.349
## ===============================================
After outliers handling, all the distribution of scatter points of both red wine and white wine become much smaller. the shapes of distribution of red wine look like oval a little bit. On the other hand, the shope of white wine become quite round. The informaiton of correlation coefficient let us know that removing outliers of data of red wine make the retion a little bit weak. But it make no difference to the data of white wine.
## color quality_level corr
## 1 red Low 0.4507759
## 2 red Medium 0.3745896
## 3 red High 0.5163765
## 4 white Low 0.2339382
## 5 white Medium 0.1021815
## 6 white High 0.1284902
It is reasonable that this plot is similar to that of the relationship between density and fixed acidity, because citric acid is one of major acidity in fixed acid (Fixed Acidity). However, fixed acitiy contains other acid such as tartaric, malic, and succinic acid. So this relationships are weaker that the relationships between density and fixed acidity.
## ===============================================
## Grouped by Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.423
## Red | Medium 0.268
## | High 0.347
## ----------------------------------------------
## | Low 0.042
## White | Medium 0.045
## | High 0.078
## ==============================================
## Grouped by Quality Level and Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.349
## Red | Medium 0.313
## | High 0.382
## ----------------------------------------------
## | Low 0.17
## White | Medium 0.045
## | High 0.081
## ===============================================
The results of the outliers handling here are almost as the same as the previous results (density vs fixed acidity).
## color quality_level corr
## 1 red Low 0.6267012
## 2 red Medium 0.6743878
## 3 red High 0.7452792
## 4 white Low 0.3146377
## 5 white Medium 0.2813673
## 6 white High 0.2540356
In this plot, it shows that citric acid tends to increase with increasing fixed acidity in both red wine and white wine. The difference is that the higher the quality of red wine is the stronger the relationship is. On the other hand, it does not show this phenomenon in white wine. Besides, the higher quality level has the higher correlation coefficient in red wine. But it is in contrast in white wine.
## ===============================================
## Grouped by Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.615
## Red | Medium 0.619
## | High 0.707
## ----------------------------------------------
## | Low 0.245
## White | Medium 0.29
## | High 0.217
## ==============================================
## Grouped by Quality Level and Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low 0.55
## Red | Medium 0.648
## | High 0.722
## ----------------------------------------------
## | Low 0.276
## White | Medium 0.299
## | High 0.179
## ===============================================
Similarly, the results of outliers handling here are almost as the same as the previous results. But it is a little bit different in the distribution of the scatter points between the one gourped by wine’s color and the one grouped by quality level and wine’s color. We can see that the distribution of that grouped by wine’s color are smaller.
## color quality_level corr
## 1 red Low -0.4932947
## 2 red Medium -0.5720388
## 3 red High -0.4947980
## 4 white Low -0.1764287
## 5 white Medium -0.1020559
## 6 white High -0.2356505
As wikipedia says that citric acid would eventually be converted into acetic acid which is the main compound in volatile acidity. Thus, as what we can see in this plot, citric acid tends to decrease with increasing volatile acidity in red wine. However, there is no such tendency in white wine. In my opinion, it might be that the primary alcohol fermentation should be almost done in red wine due to the tendency of yeast to convert citric into acetic acid.
## ===============================================
## Grouped by Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low -0.557
## Red | Medium -0.67
## | High -0.685
## ----------------------------------------------
## | Low -0.158
## White | Medium -0.095
## | High -0.15
## ==============================================
## Grouped by Quality Level and Wine's Color
## color | quality level corr
## ----------------------------------------------
## | Low -0.53
## Red | Medium -0.657
## | High -0.327
## ----------------------------------------------
## | Low -0.115
## White | Medium -0.079
## | High -0.234
## ===============================================
Outliers handling make the distribution of scatter points of white wine quite round and small, but the correlation coefficients do not change too much. It implies that the outliers do not affect the relation so much. On the other hand, we can find out that the results are quite different in red wine between the one grouped by wine’s color and the one grouped by quality level and wine’s color. In the result of the one grouped by wine’s color, the relation at all the quality level are strengthened by removing the outliers. However, the change of the relation of the one grouped by quality level and wine’s color is so different It increases with increaing quality level from low level to medium level, and then it decreases to the value as 1/2 as that at medium quality level.
In this section, the plot with outliers handling just used to compare with the plot without outliers handling. Bascially, the box plots with outliers handling just make the tendency more clear. There is no much change in all the plots.
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 8.4 9.4 9.7 9.926 10.3 14.9
## 2 red Medium 8.4 9.8 10.5 10.630 11.3 14.0
## 3 red High 9.2 10.8 11.6 11.520 12.2 14.0
## 4 white Low 8.0 9.2 9.6 9.850 10.4 13.6
## 5 white Medium 8.5 9.6 10.5 10.580 11.4 14.0
## 6 white High 8.5 10.7 11.5 11.420 12.4 14.2
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 9.0 9.4 9.60 9.772 10.00 11.5
## 2 red Medium 8.7 9.8 10.50 10.560 11.20 13.3
## 3 red High 9.5 10.9 11.30 11.420 11.92 13.6
## 4 white Low 8.0 9.2 9.55 9.793 10.30 12.6
## 5 white Medium 8.5 9.7 10.50 10.620 11.40 14.0
## 6 white High 8.7 11.0 11.75 11.700 12.50 14.2
The overall tendency of both red wine and white wine show that both quality increases with increasing alcohol content. If we focuse on the low quality level, it shows no relationship in red wine, and it shows that quality increases with decreasing alcohol content in white wine. On the other hand, quality increases with increasing alcoho content in both red wine and white wine at the medium and high quality level. Besides, red wine and white wine have similar alcohol content.
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 1.2 1.9 2.200 2.542 2.60 15.50
## 2 red Medium 0.9 1.9 2.200 2.477 2.50 15.40
## 3 red High 1.2 2.0 2.300 2.709 2.70 8.90
## 4 white Low 0.6 1.7 6.625 7.054 11.02 23.50
## 5 white Medium 0.7 1.7 5.300 6.442 9.90 65.80
## 6 white High 0.8 1.8 3.875 5.262 7.40 19.25
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 1.2 1.9 2.1 2.185 2.5 3.5
## 2 red Medium 1.2 1.9 2.1 2.132 2.4 3.0
## 3 red High 1.4 1.8 2.1 2.184 2.5 3.6
## 4 white Low 0.6 1.8 7.2 7.433 11.7 23.5
## 5 white Medium 0.7 1.7 5.2 6.257 9.6 20.8
## 6 white High 0.8 1.7 2.9 4.071 5.8 14.8
It is obviously that white wine has much more residual sugar than red wine.
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 5.120 7.705 8.39 8.732 9.480 16.26
## 2 red Medium 5.300 7.605 8.40 8.845 9.881 14.61
## 3 red High 5.320 7.780 9.04 9.253 10.490 16.28
## 4 white Low 4.415 6.678 7.16 7.272 7.760 12.03
## 5 white Medium 4.110 6.550 7.03 7.098 7.568 14.47
## 6 white High 4.125 6.505 6.98 6.990 7.482 9.45
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 5.520 7.722 8.340 8.533 9.388 11.67
## 2 red Medium 5.980 7.640 8.285 8.620 9.390 12.68
## 3 red High 6.500 8.210 9.305 9.377 10.560 12.39
## 4 white Low 4.415 6.670 7.100 7.182 7.680 10.47
## 5 white Medium 5.090 6.520 7.000 7.039 7.520 9.04
## 6 white High 5.230 6.490 6.910 6.933 7.350 8.80
In this plot, I crated one variable, total acidity which is calcualted by sum fixed acidity and volatile acidity. I did not count citric acid because I think that citric acid should be one compound in fixed acidity. As we can see on, it is obviously that red wine has more acidity than white wine. The interesting thing is that the acidity in white wine at high quality level concentrates on about 7.5 g/dm³.
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 6 23.75 45 54.65 78 155
## 2 red Medium 6 23.00 35 40.87 54 165
## 3 red High 7 17.00 27 34.89 43 289
## 4 white Low 9 117.00 149 148.60 182 440
## 5 white Medium 18 107.20 132 137.00 164 294
## 6 white High 34 101.00 122 125.20 146 229
## color quality_level Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 red Low 7 26.00 44.5 53.95 73 147
## 2 red Medium 6 23.25 34.0 38.11 49 94
## 3 red High 7 15.75 24.0 26.14 35 56
## 4 white Low 19 117.00 149.0 149.10 182 260
## 5 white Medium 24 107.00 130.0 135.50 162 248
## 6 white High 53 98.00 116.0 118.00 136 203
According to the wine quality information document, sulfur dioxide prevents microbial growth and the oxidation of wine. As we can see, Much more sulfur dioxide is used in white wine to keep white wine’s quality. Especially, the used quantity of sulfur dioxide might be controled to the range 116 to 121 mg/dm³.
In the boxplot plot, it shows that white wine has higher residual sugar and total sulfur dioxide, especially for that the quantities at the high quality level have relative small range, which implies that the quantities of these feature are controlled in my poinion. On the other hand, red wine has higher acidity. The interesting thing is that the red wine with higher fixed acidity and low volatile acidity has better quality.
With the consideration of outliers handling, the conclusions are shown belowe.
Red wine with higher alcohol content, higher fixed acidity (citric acid) and lower volatile acidity seems have better quality.
White wine with higher alcohol content and lower residual sugar at low and medium quality level, but it seems that there is weak or even no relation at high quality level. Besides,total sulfur dioxide controlled at about 120 mg/dm³ seems have better quality.
According to the analysis of the relationships amoung alcohol, density, residual sugar and acditiy, I found out that it seems that the fermentation of red wine is almost done but the fermentation of white wine is controlled at some level, especially for those at low and medium level. When I looked at the relationship between alcohol content and residual sugar, there is no apparent relationship in red wine and strong negative relationship in white wine. However, compared with the results with outliers handling, I found that there is weak or no realtion between alcohol and residual sugar, which is quite similr to that of red wine. It might be that the fermentation is almost done for the white wine at high quality level. As what we know that alcohol is generated by the fermentation of sugar, the box plot also show that the alcohol content of white wine is highest at high quality level. Additionally, the relationship between citric acid and fixed acidity and the relationship between citric acid and volatile acidity also show similar conclusion.
Alcohol content is the strongest relationship between a featrue and quality in both red wine and white wine. The overall tendency of both red wine and white wine show that quality increases with increasing alcohol content. However, If we focuse on the low quality level, the tnedency is not clear in red wine; and the alcohol content tends to decrease with increasing quality. On the other hand, the higher the alcohol content is the higher quality is at medium and high quality level for both red wine and white wine. Besides, after outliers handling, the higher the quality is the smaller the range of alcohol is for both red wine and white wine. However, there is no such phenomenon in red wine at low quality level.
By combining a scatter plot with density plots of the x- and y-axis variables, it is easier to see that a tnedency from low to high quality level. For high quality level, it seems that total acidity and alcohol contents have a wide range. But for medium quality level, its alcohol content is also quite wide, but not total acidity. For low quality level, both alcohol content and total acidity concentrate on low range. It might be hard to distinguish the red wine at high level and medium level, but the red wine at low level has lower total acidity and alcohol content. Thus, if we focus on the scatter plot, the data of high quality level occupies a quite range, but the data of low quality level concentrates on the range at bottom. In my opinion, you probably have red wine with high quality if you feel a little sour and get drunk.
Actually, it is a little bit hard to observe the scatter plot. But the density plots help us to have an insight into the data. It is obvious that the data of high quality level almost locate at left upper side, and the data of medium quality level locates at the left middle part, and the data of low quality level locate at the left bottom side. According to the density plots, high quality level has lower and narrow range of residual sugar but has high and wide range of alcohol content, which means that alcohol content is almost the same with changing residual surgar. Medium quality level has wide range of both residual sugar and alcohol content, which some how alcohol change a little bit with changing residual sugar. Low quality level has wide and low range of residual sugar but narrow and low range of alcohol, and the changing of alcohol relative to reisdual sugar is similar to that of medium quality level. Recalling the plots of alcohol vs residual sugar in Multivariate section, this plot also implies the similar information, the fermentation of high quality level might be almost done so that there is no relation between alcohol and reisdual sugar, but the fermentaion of both medium quality level and low quality level might be still processing. So, it strengthens my thought again, about that the progress of fermentation might affect the quality of white wine.. Personally, I like wine sweeter. But after this analysis, I will try to have a bitter and strong white wine at the next time.
I was thinking that it should be quite different between red wine and white wine. Thus, I decided to explore the data of red wine and white wine at the same time. At the beginning of this project, I was struggling with how to provide a meaningful project. I decided to take a look at the individual histrogram distribution of the features to get some feel for each one. Althought the distributions of each feature are quite similar between red wine and white wine, I still found that the quantities and tendencies are quite different between red wine and white wine as whtat I thought that there must be something qutite different between red wine and white wine. Thus, I was not so suprised to the difference between red wine and whihte wine in the results distributions.
In bivariate plots section and analysis, I was struggling with what kind of data exploratory I should perform. The results of ggcorr function did help me a lot. The plot of ggcorr function is convenient to observe the relationships between the features. I found that red wine and white wine have quite difference in the relationships between the features, especially residual sugar and acidity. I picked up several relationships which I am interesed to check their scatter point distrbutions. As the result, I found out that residual sugar have high influence on white wine, and acidity have high influence on red wine.
At the beginning of this report, I created a variable, quality_level, classifying the quality into the three levels, low, medium and high. This variable was used in Multivariate Plots Section to check the effect of each feature on the quality. As what I expected, the tendencies are different at differnt quality level. That strengthened my confidence to that residual sugar have high influence on white wine, and acidity have high influence on red wine.
I did not investigate all the relationship between the features. However, low correlation coefficient does not mean that the corresponding features have no influence on the quality. For example, suphates is the additive which contribute to sulfur dioxide gas. It should have some influence on white wine. Besides, I found out that there are lots of outliers in different features. But I did clean the data to make the results more reliable. Additionally, I would like to know the years or vintage of each individual wine, because the years and vintage have great influence on wine’s quality. These kinds of analysis of red/white wine should be interesting as the future work.
According to the suggestion given by the previous reviwer, I did the analysis of outliers handling. At the beginning, I was frustrated with how to perform this analysis. As the basic knowledge I learned in first project of this nanodegree, I know IQR rule (1.5*IQR ± Q1 or Q3) is one of the way to detect outliers. According to study outliers handling, I have looked for lots of information. There are many ways like deivation, regression analysis and cook’s distance to detect the outliers. Besides, I noticed that the new outliers will come out when I replace the old outliers with NA. The change of the sample numbers results in the new outliers. In order to eliminate the outliers, I did the iteration of outliers handling. Actually, I am not sure whether it is a good idea or not, because I do not have much experience of it. Also, as the information I got, some people replace the outliers with capping value, mean value or median value. The results of outliers hadling really affected by how I group the data during the process. It might not be hard to observer when the scale of group is big like using color to group. But the difference can be seen when the group is getting small like using quality level even quality to group the data. Besides, I also fund package called “outliers” and “mvoutlers” in R. But I did not use it. Because , it is my first time to do outlier handling. Althoug it took lots of my time to code the program of outliers handling, I have learned very much. Of course, I did not do refactoring of my code, I also found some of problems like the outliers would not be removed very clean. I think it is turely interesting to do the analysis which focouses on outliers handling, whcih includes refactoring of the code as the study in the future .
It was my first time to use R to do the analysis. I spent lots of time on learning the skills of R, especially for plotting. In my opinion, it is truly convenient to do data exploratory or analysis with R. It has lots of powerful functions to promote the efficiency. Besides, in order to make the explaination meaning ful, I did looked for severl information about wine, especially for sulfur dioxide, residual sugar and acidity (fixed acidity and volatile acidity).